Good Setup for our use-case

Inactive-Member-29214746 · May 15, 2020, 2:04pm

Hi,
we are currently thinking about changing how we use PagerDuty and need some input on how we can achieve what we want.

Current state:
We have a few different services that describe different environments of our product (staging, production-eu, production-us, global-infrastructure). We have one team with a on-call schedule that rotates on a weekly basis for all different “services”. Each alerting tool will report, based on the environment to the right PagerDuty service.

What we want to have:
We want the current team and schedule to be only responsible for production alerts. For alerts in staging we want to integrate our development teams so they will get alerts based on their responsibility for microservices.

Example:
Team A (Dev-Team): microserviceA, microserviceB
Team B (Dev-Team): microserviceC, microserviceD
Team C (Dev-&SRE-Team): microserviceE, microserviceF

Alert of microserviceA in staging should be routed to Team A
Alert of microservice C in production-eu shold be routed to Team C

We want to do this because teams should be sensible to alerts in our development environment (staging), currently Team C has to forward each alert in staging manually within our organizations chat tool. This does not scale and prevents the Dev-Teams from taking responsibility for their services.

Problem we are facing:
We are unsure how to setup the structure in PagerDuty. Defining a service per microservice seams to be the right approach, but we are currently unsure how we can route alerts of different environments to different teams.

Thanks in advance for any help!
Best, Jakob

Inactive-Member-27144877 · May 15, 2020, 2:04pm

Hi Jakob,

Thanks for reaching out on our community page. Your explanation sounds good, and very achievable, the best way to do this however would be to use our Rulesets function.

You would need to create micro-services and have corresponding Escalation Policies and schedules for those teams that need to be alerted about the Staging environment for example. but you can use rulesets to create events rules so the incoming events are directed to the right team. This has the benefit of having all events triggered with one routing key.

Hope this helps, let us know if you have any more questions.

John

dmcclure · May 16, 2020, 3:20pm

Jakob,

I would encourage you to broaden your thinking a bit. What are the business aligned products/offerings/services/applications that those microservices enable or support? I would think about adding a few more layers of business and/or technical services in PagerDuty that helps you not only mobilize the right team(s) but also helps you understand the impact on your business and/or customers, where those impacts are, how severe they may be, and position yourself for the future with a well configured PagerDuty foundation.

Then you can start to onboard alerts and route them into the right PagerDuty Technical Service and related teams using this new context. The key to success with this is that each and every one of yoru incoming monitoring events/alerts contains rich metadata, tabs, labels, structured host/node name, etc. so you can easily identify what the event/alert is from, what it impacts, and what PagerDuty Technical Service you should route it to. In your example, I’d want to see each event/alert contain something lnike this: “Business Service: eCommerce, Business Application: Web Commerce, Function/Microservice: Check Inventory, Environment: Production” and then I’d create a rule in my ruleset for the eCommerce Application (or team) to route that event to the eCommerce:Web:Check Inventory:Dev service notifying the on-call dev team (using low urgency notifications ).

Make sense?

Doug

Inactive-Member-29214746 · May 20, 2020, 10:23am

Thank you for your replies!

Inactive-Member-29214746 · July 1, 2020, 1:07pm

Hi @dmcclure and @Inactive-Member-27144877,
today I had the time to try out your suggestions. But I fail to get it working using Rulesets.

The problem is that I can only assign one escalation policy per service. But I want only alerts from staging systems to be routed to the development teams. The operations team should receive all alerts of each service in production systems.

How can I solve this?

And another question: If I create a service in PagerDuty for each microservice. How can I create a maintenance window for alerts that happen between X and Y, but only in production-eu (and not production-us)?

Best,
Jakob

dmcclure · July 2, 2020, 12:15am

Jakob,

Yes, the limitation is 1:1 - service to EP. You’d need to have an actual PagerDuty Technical Service for “serviceA-prod” and another for “serviceA-non-prod” where you place the production support team’s EP on “serviceA-prod” and the development team’s EP on “serviceA-non-prod”. Then in your rulesets, you’d need create rules that identify incoming events by some event metadata like a pattern in the hostname, tag/label or similar and route them into the appropriate service. So all of your “serviceA-prod” alerts might have a tag/label in the alert named “Env:Production” and you’d look for that as the routing condition to the production service.

For maintenance windows, you have two options. First, you could sent the entire PagerDuty Technical Service into maintenance and that would suppress all alerts destined for that service. Alternatively, you can create very specific time based ruleset rules for a given node, cluster, container, microservice, etc. (same approach as above) and only suppress those during a given fixed or reoccurring time window. The best practice would be this fined grained approach and to set these via API, ideally linked to a central change/maintenance process.

Let me know if this helps!